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A wide range of natural and social phenomena result in observables whose distributions can be well approxi- 
mated by a power-law decay. The well-known Hill estimator of the tail exponent provides results which are in 
many respects superior to other estimators in case the asymptotics of the distribution is indeed a pure power-law, 
however, systematic errors occur if the distribution is altered by simply shifting it. We demonstrate some related 
problems which typically emerge when dealing with empirical data and suggest a procedure designed to extend 
the applicability of the Hill estimator. 
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I. INTRODUCTION 



fined as follows: 



Heavy-tailed distributions emerge in many situations, with 
examples ranging from social networks to earthquake inten- 
sities, city sizes etc. (for further examples, see [1] and refer- 
ences therein). In mathematical terms, a random variable X 
has a heavy upper tail if the probability 



¥(X > x) oc x~ a L{x) 



(1) 



with a > and L (x) being a slowly varying function of its ar- 
gument. The function F (x) = P (X ^ x) is called the (com- 
plementary) cumulative distribution function (cdf in the fol- 
lowing). The exponent a, termed the tail exponent is a param- 
eter of practical importance, since this is the quantitity which 
determines the frequency of extreme events (e. g. huge losses 
on the stock market). 

The general problem can be formulated as follows: given 
some finite sample S = {Xi, X2, ■ ■ ■ , X^} of independent, 
identically distributed elements, of which the distribution can 
be described by Eq. (1), we intend to find an efficient pro- 
cedure to estimate the tail exponent. There exist many such 
estimators, each of those has its advantages and drawbacks. 
The difficulties lie generally in the following: 

• The small x form of the cdf, which in Eq. (1) is incor- 
porated in L (x), can shorten the "effective tail length", 
i. e., the domain where the distribution is close to a 
power-law. Therefore, the actual form of L (x) affects 
the speed of convergence of any estimator. 

• If a is relatively large, one needs a huge dataset to have 
enough points in the tail. 

• Linear transformations of a random variable do not af- 
fect its asymptotic behavior (i. e., a), but can affect the 
value of an estimator over a finite sample. 

The popular Hill estimator [2] (HE in the following) is 
based on the n largest observations in the sample, and is de- 
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with Xn\ > X(2) > ■ ■ ■ > X/]sr\ being elements of the 
order statistics. Note that Eq. (2) is invariant to multipli- 
cation («h (a ■ S,n) — dn (S, n)), yet not shift-invariant 
(Ah (S + s,n) ^ An (<S, ri)). For a fixed n, the HE is a max- 
imum likelihood estimator of the tail exponent, but the appro- 
priate choice of the tail length n remains an issue, since «h is 
typically very sensitive to it. The standard way to determine 
the threshold xo = X( n ) is to construct a so-called Hill plot, 
which is 1 /q;h as a function of n, and look for a plateau in the 
graph (for other evaluation methods, see [3]). 

In a recent publication [4], Clauset et al. suggest a proce- 
dure (CSNE in the following) to solve the former problem, i. 
e. to find the optimal n, in an automated fashion. This method 
provides superior results if L (x) = const., but inherits the 
sensitivity of the Hill estimator to the actual form of L (x). 

The shifted Pareto tail (Eq. (3)) presents a special case of 
Eq. (1): 



F 0) oc + s) a = x- a ■ (1 + s/x)~ 



(3) 



Although, at first sight, the introduction of data shifts does not 
seem a drastic change, it limits the applicability of both the 
HE and the CSNE procedures. 

A method to tackle the problem of data shifts is for exam- 
ple the Fraga Alves estimator [5], which is invariant to both 
shifts and multiplication of the dataset, at the cost of a slow 
convergence. Another example is the Meerschaert-Scheffler 
estimator [6] which is shift-independent, but not invariant to 
multiplication, and furthermore its applicability is restricted 
to a < 2. 

The aim of the present work is to show an extension of 
the Hill estimator which can handle both the threshold and 
the shift problem, in a similar procedure as CSNE. The pa- 
per is organized as follows: In Section II we analyze the sys- 
tematic errors introduced by the shift in the distribution and 
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Figure 1: Effective tail length as a function of data shift. Data details: 
a = 2.36 pure power-law, shifts are given in units of the mean ab- 



solute deviation (E (\X - E (X)|) = srr(^rr) 



0.7). For 



each shift, 100 samples of size 10000 were generated, the cyan col- 
ored points correspond to the result of the fitting procedure on each 
of those. The blue colored curve corresponds to the mean of the 100 
runs at each shift, (color online) 
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Figure 2: Estimated exponents as a function of data shift, same setup 
as in Fig. 1 . The blue colored curve corresponds again to the average 
of the estimates. Note that shifts result not only in a shorter tail, thus 
a larger standard deviation, but also in the shift of the mean estimate, 
(color online) 



demonstrate how this can be taken into account in the estima- 
tor. Section III demonstrates the performance of the suggested 
method on computer-generated data. In Section IV we present 
an analysis of empirical data as taken from traded volumes of 
stocks on the stock market. In Section V we give the conclu- 
sions, the Appendix briefly summarizes the CSNE algorithm 
[4]. 



statistic (Dks (F, G) = sup x \F (x) — G (x) |) as the defini- 
tion of distance. This method has proven to be a useful tool, 
however, tests on shifted power-law samples (Eq. (3)) show 
that in that case it provides biased results (Figs. 1 and 2). 
While shifts are irrelevant asymptotically (if \s\ <C x, then 
(x + s)~ a s» x~ a ), having a finite dataset at hand they result 
in a shorter "effective tail length" (i. e. the threshold above 
which the (x + s)~ a s» x~ a approximation is valid becomes 
higher). This observation explains that with growing shift, 
the CSNE procedure provides more and more volatile results 
(Fig. 2). The average estimate deviates from the zero shift 
value since in the shifted case, the cdf is deterministically ei- 
ther convex or concave, i. e. the estimator is bound to deviate 
from the true value in a fixed direction. Thus, the task is to 
optimize the tail length n and the shift parameter s simultane- 
ously. If this is achieved, the Hill formula can be applied to 
the n largest observations shifted with the previously obtained 
value of s. 

In the simplest case, let us assume that whole the sample is 
taken from a shifted power-law distribution, i. e., we do not 
have to deal with estimating the tail length. Our aim is to find 
a shift estimator s (S), for which S' = S + s (S) is well- 
approximated by a pure, non-shifted Pareto law. Note that, 
from the practitioner's point of view, s does not necessarily 
need to be very accurate, since as Fig. 2 shows, the mean esti- 
mate depends smoothly on the shift and has a small standard 
deviation in the vicinity of zero shift. 

In geometric terms, we have to "straighten out" the cdf 
plot, i. e., to determine the shift so that the cdf of S' on a 
doubly logarithmic plot is as close to a straight line as pos- 
sible. The simplest way to achieve this is to minimize the 
mean squared error of the linear fit on the log-log plot (via 
numerical optimization, e. g. the golden section method [7]). 
Figure 3 and Table I show the performance of the latter proce- 
dure on computer-generated shifted Pareto samples. It can be 
concluded that although this type of estimator slightly under- 
estimates the shift on average, it provides reasonable results. 
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Table I: As the table shows, the procedure depends on a, its bias 
and standard deviation both increase slightly with growing a. (The 
averages were taken over all shifts for a fixed exponent, i. e. 19x1000 
trials with sample size N — 10 4 .) 



II. OPTIMIZING THE SHIFT 

In short, the CSNE algorithm [4] optimizes the parame- 
ter n of Hill's estimator, so that the distance of the fitted 
power-law and the conditional cdf be minimal (for a sum- 
mary in terms of formulas, see the Appendix). Clauset et 
al. suggest the Kolmogorov-Smirnov (KS in the following) 



Back to the general case, where the tail length can be 
smaller than the sample size, one has to estimate n along 
with the shift s. A consistent and simple way to incorpo- 
rate the shift estimator in a procedure in the manner of [4] 
is to estimate at each tail length n the shift s based on the 
n largest entries of the sample. Having obtained s, one has 
to shift the tail with this value, and calculate the Hill es- 
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Figure 3: The figure shows the shift estimator's output on computer- 
generated shifted power-law samples (a = 2.0, N = 10 4 ), at each 
shift value, 100 trials were performed. Note that the procedure is 
definitely shift-invariant, as intended, (color online) 



timator using the shifted tail. Thus, we obtain (rij, s,-, on), 
i = 1,2, ... ,K ^ N, and from these, we accept the one 
which is the closest to the empirical cdf, according to the KS 
statistic. So the "recipe" is the following: 

1. For each tail length n, calculate s (T n ) (with T n denot- 
ing the n largest elements in the sample), 



2. calculate dn {S + s (T n ) , n), 



3. calculate the KS-distance between the cdf of T n and 

-atH 

, with xq = X( n \, as previously. 



( x + s \ 

\x +S J 



4. Accept the fit with the lowest KS-statistic. 

The set of n tail lengths to test is chosen in the same manner 
as in the CSNE procedure (see the Appendix). Sections III 
and IV analyze the capabilities of this procedure on computer- 
generated and empirical data. When considering empirical 
data, one cannot assume that the cdf has exactly the form of 
Eq. (3), rather a variant of Eq. (1): 



F (x) oc (x + sy a ■ L(x) = x~ a ■ L (x) . 



(4) 



Although the difference between Eq. (1) and (4) is only in 
grouping, it can pay off in case L (x) is closer to a constant 
than L (x). 



III. SIMULATION RESULTS 

The procedure was tested on computer-generated datasets 
of size N = 10 4 consisting of independent elements dis- 
tributed according to a shifted power-law: 



(x + s) a ifx^l, 
1 otherwise, 



(5) 
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Table II: Average and standard deviation of the exponents estimated 
with the new method. 
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Figure 4: Comparison of the CSNE estimator and the method intro- 
duced in Section II. Note that the standard deviation obtained using 
this new procedure is larger than that of the CSNE at zero shift, this 
is the price of shift-independence, (color online) 



for i = 1,2, 
the range 



, N. The parameters s and a were varied in 

s e [-0.9,0.9], 
a e {1.5,2,2.5}. 



Figures 4-5 show the output of the procedure introduced in 
Section II for datasets with a fixed exponent a = 1.5. One 
can conclude that the method accounts for the shift problem, 
although at the price of an increased standard deviation rela- 
tive to the CSNE zero shift case. The estimates of the shift and 
the tail length are not as accurate as those of the exponent, but 
this is not surprising, as the tail-exponent is relatively stable 
to small changes in the shift and the tail length as well. 

Table II shows the performance of the new method for dif- 
ferent a values, the averages comprise the estimates with all 
shift-values considered. The accuracy gets worse with in- 
creasing exponent, this is no surprise, since the shift estimator 
and the Hill estimator both display this property. 



IV. EMPIRICAL DATA 

As an application, we use the method to determine tail ex- 
ponents of stock market data. The dataset considered was that 
of the dollar value [13] of individual transactions of the 1000 
largest stocks (according to the total number of trades in the 
period) in 1994-95 at the New York Stock Exchange [8]. We 
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Figure 5: The estimated tail sizes in the procedure introduced in Sec- 
tion II. (color online) 
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Figure 6: Comparison of the CSNE and the new procedure on empir- 
ical data. The black bars correspond to the histogram of the CSNE 
estimates, the red dashed bars to the new one. (color online) 



wish to emphasize, however, as there is no firm theoretical ev- 
idence supporting the power-law property of the distribution 
of trade sizes, that we see it only as one possibility to model 
the tail of the cdf. Fig. 6 compares the histogram of the tail 
exponent estimates gained using the CSNE estimator, and the 
new, shift-invariant procedure. Fig. 7 shows the fits provided 
by the two estimators on a stock for which they provide very 
different estimates. 

The new procedure found fits with a KS distance on average 
23% lower than the CSNE estimator. This is not a drastic dif- 
ference, but on a logarithmic scale, the KS statistic is less sen- 
sitive in the tail than for small x values. Fig. 7 demonstrates 
this in the case of a specific stock (Kmart Corporation): al- 
though on a linear scale (upper part), the two fits do not seem 
to be very different, the logarithmic plot (lower part) shows 
that the two deviate in the tail region. In this specific case, the 
distance of the new fit improved from 0.035 to 0.021. 

In empirical data, an additional problem has to be consid- 
ered when analyzing the results. In case there is no strong 
theoretical indication for Eq. (4), this type of ansatz can only 
be accepted, if there are at least some datapoints in the tail, i. 
e., for which the approximation 



(x + s)~ 



1 



(6) 



is applicable. This is of importance because fitting an expo- 
nent to a non-observed tail is questionable, and even if there 
is theoretical evidence for a shifted power-law tail, errors are 
amplified. In other words, the quantity 



One can conclude that in the case of stock trade volumes, 
the inclusion of data shifts clearly has an effect on the results. 
Furthermore, note that the typical value of the estimates is ap- 
proximately 2, i. e. on the boundary of the Levy regime. This 
matter has been controversial and our present results support 
the view that the exponents are higher than thought earlier 
(Refs. [9], [10], [11] and [12]). Furthermore, since the esti- 
mates have a large standard deviation, we find that the term 
universality is not applicable to trade volume distributions. 



V. CONCLUSIONS 

In this paper, we have shown for empirical as well as 
computer-generated data, that data shifts can play an impor- 
tant role in the tail exponent estimation procedure. Tests on 
computer-generated datasets showed that this problem can 
be solved by the suggested extension of the Hill estimator. 
A small, yet systematic underestimation of the exponent is 
present, nevertheless, on empirical data, this bias loses its im- 
portance when compared to other error sources. Our results 
regarding stock market data lead us to doubts about the idea 
of universal tail exponents (Ref. [9]-[10]) regarding both uni- 
versality and typical values. 



Appendix: THE CSNE ALGORITHM [4] 



(7) 



(with x max = maxi {Xi 6 5}) which measures the "dis- 
tance" of the largest observation from the tail region, has to 
be small. For the data analyzed, about 10% of the stocks had 
5 ^ 0.1. If we exclude these data from the statistic, the aver- 
age exponent is 2.0 with a standard deviation of 0.35 [14]. 



The CSNE method optimizes the tail length parameter of 
the Hill estimator: the algorithm calculates dn (S, n) for 
many different ?i-s and accepts the fit which is the closest 
to the empirical cdf, according to the Kolmogorov-Smirnov 
statistic. In terms of formulas, the procedure for a given set 
S = {Xi,X2, ■ ■ ■ ,Xn} observations, can be described as 
follows: 
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Figure 7: The cdf of trade sizes for the shares of the Kmart Corpora- 
tion (cyan crosses). The red and black curves show the fits provided 
by the two estimation procedures. The upper and lower plots differ 
only in the scale of the vertical axis, (color online) 



1. Sorting: 

{Xi, 



, X N } — ► {-X"(i), • • • , -X(i\r)} : 

^(1) ^ X( 2 ) > • • • > ^"(JV) 



2. Choose the set C C {1,2,..., N} of tail lenghts to 
check, according to the following criteria: 

• Do not include n ^ n m - m , because it does not 
make any sense to calculate the Hill estimator 
based on e. g. 5 elements. 

• If N — n m j n + 1 > M choose M elements from 
{"min, "min + 1, ■ • • i N} uniformly to obtain C. 

3. Distance: Kolmogorov-Smirnov statistic: 

F x (x i )=¥(X^x i \ X^x) 



Dks (%) = max 

Xi >x 



(A.l) 



4. For all Xr n \ £ C calculate the Hill estimator d^ 1 (S, n) 
and the distance of the Hill fit and the empirical distri- 
bution function, Dks (n). 

5. Accept the tail-length n which minimizes the distance: 



n = argmin_D K S (^"(n)) 



6. As a result, we obtain: 



a = d H (5, n ta ii) 



• X — X^n) 

• d = D KS (x ) 
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